Goto

Collaborating Authors

 cluster membership




exp(f(C i, C

Neural Information Processing Systems

We sincerely thank all reviewers for their valuable comments. Our responses to the comments are listed below. If we understand it correctly, "the approximate posterior in If so, we do consider the effects of the KL term in the proof. Dir(ฮฒ), which is a Dirichlet posterior with a Dirichlet prior Dir(ฮฑ). Does Claim 4.1 rely on a specific distribution (q1 under Weaknesses)?


Dirichlet Graph Variational Autoencoder

Neural Information Processing Systems

Graph Neural Networks (GNN) and Variational Autoencoders (VAEs) have been widely used in modeling and generating graphs with latent factors. However there is no clear explanation of what these latent factors are and why they perform well.


Clustering Approaches for Mixed-Type Data: A Comparative Study

arXiv.org Machine Learning

Clustering is widely used in unsupervised learning to find homogeneous groups of observations within a dataset. However, clustering mixed-type data remains a challenge, as few existing approaches are suited for this task. This study presents the state-of-the-art of these approaches and compares them using various simulation models. The compared methods include the distance-based approaches k-prototypes, PDQ, and convex k-means, and the probabilistic methods KAy-means for MIxed LArge data (KAMILA), the mixture of Bayesian networks (MBNs), and latent class model (LCM). The aim is to provide insights into the behavior of different methods across a wide range of scenarios by varying some experimental factors such as the number of clusters, cluster overlap, sample size, dimension, proportion of continuous variables in the dataset, and clusters' distribution. The degree of cluster overlap and the proportion of continuous variables in the dataset and the sample size have a significant impact on the observed performances. When strong interactions exist between variables alongside an explicit dependence on cluster membership, none of the evaluated methods demonstrated satisfactory performance. In our experiments KAMILA, LCM, and k-prototypes exhibited the best performance, with respect to the adjusted rand index (ARI). All the methods are available in R.



Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. Summary: This paper presents a provable guarantee under what conditions the convex optimization procedure (COP) can successfully recover the correct clustering solutions. The main result is: if the samples are drawn from two cubes, each being a cluster, then COP can obtain the correct clustering solution provided the distance between two cubes is larger than a threshold value that linearly depends on the cube size and the ratio of numbers of samples in each cluster. The proof is based on the idea of lifting, which projects the problem into a higher dimensional space that transforms the original formulation into a separable form (separating the regularization term into the sum of l_2 norm of each row). After constructing the optimal dual solution through some algebraic operations, the primal optimal solution can be obtained.


p(A | Z) = ฮพ null

Neural Information Processing Systems

We sincerely thank all reviewers for their valuable comments. Our responses to the comments are listed below. If we understand it correctly, "the approximate posterior in If so, we do consider the effects of the KL term in the proof. Does Claim 4.1 rely on a specific distribution (q1 under As for Gaussian variables (the case in VGAE), we are not sure if the claim still holds. We agree sometimes it might be hard to follow and we shall add the details in the revision.


A Computational Approach to Improving Fairness in K-means Clustering

arXiv.org Artificial Intelligence

Clustering is an important problem in data mining. It aims to split the data into groups such that data points in the same group are similar while points in different groups are different under a given similarity metric. Clustering has been successfully applied in many practical applications, such as data grouping in exploratory data analysis, search results categorization, market segmentation etc. Clustering results are often used for further analysis or interpretation. However, directly applying results obtained from usual clustering algorithms may suffer from fairness issues-some cluster may favor data points from one of the subpopulations, i.e., having disproportionally more points. One example of 1 Figure 1: Illustration of the fairness issue in clustering, Points of different color indicate different traits on a sensitive variable, e.g., gender where blue indicates male and red female. Cluster 1 is dominated by females while Cluster 2 by males. Points with an arrow indicate that we might switch its cluster membership assignment to make the clusters less dominated by one subpopulation.


Clustering by Nonparametric Smoothing

arXiv.org Machine Learning

A novel formulation of the clustering problem is introduced in which the task is expressed as an estimation problem, where the object to be estimated is a function which maps a point to its distribution of cluster membership. Unlike existing approaches which implicitly estimate such a function, like Gaussian Mixture Models (GMMs), the proposed approach bypasses any explicit modelling assumptions and exploits the flexible estimation potential of nonparametric smoothing. An intuitive approach for selecting the tuning parameters governing estimation is provided, which allows the proposed method to automatically determine both an appropriate level of flexibility and also the number of clusters to extract from a given data set. Experiments on a large collection of publicly available data sets are used to document the strong performance of the proposed approach, in comparison with relevant benchmarks from the literature. R code to implement the proposed approach is available from https://github.com/DavidHofmeyr/ I. Introduction Cluster analysis refers to the task of partitioning a set of data into groups (or clusters) in such a way that points within the same cluster tend to be more similar than points in different clusters.